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Abstract 

Smoothing in state-space models amounts to computing the con- 
ditional distribution of the latent state trajectory, given observations, 
or expectations of functionals of the state trajectory with respect to 
this distributions. For models that are not linear Gaussian or pos- 
sess finite state space, smoothing distributions are in general infeasible 
to compute as they involve intergrals over a space of dimensionality 
at least equal to the number of observations. Recent years have seen 
an increased interest in Monte Carlo-based methods for smoothing, 
often involving particle filters. One such method is to approximate 
filter distributions with a particle filter, and then to simulate back- 
wards on the trellis of particles using a backward kernel. We show 
that by supplementing this procedure with a Metropolis-Hastings step 
deciding whether to accept a proposed trajectory or not, one obtains 
a Markov chain Monte Carlo scheme whose stationary distribution is 
the exact smoothing distribution. We also show that in this procedure, 
backward sampling can be replaced by backward smoothing, which ef- 
fectively means averaging over all possible trajectories. In an example 
we compare these approaches to a similar one recently proposed by An- 
drieu, Doucet and Holcnstcin, and show that the new methods can be 
more efficient in terms of precision (inverse variance) per computation 
time. 



1 Introduction 

The topic of the present paper is computation of smoothed expectations of 
functionals of the state process in state-space models, i.e. conditional ex- 
pectations of such functionals given data. To make this discussion more 
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precise, let (X, Y) be a state-space model, where Y = (Yf-)? is the ob- 
servable (output) process and X = (^fc)fc=o is the the latent (unobserved) 
Markov chain. The relation between the two is such that given X, the Yfc's 
are conditionally independent with the conditional distribution of a partic- 
ular Yfc depending on the corresponding Xf, only. If the state space of X 
is finite we use the term hidden Markov model (HMM). Our interest thus 
lies in computing conditional expectations of the form E[/t(Xo :n ) | yo-.n] for 
a a real-valued functional h of one, several, or all of the latent variables. 
Here Xq.^ = (X , X\, . . . , Xf.) etc.; this form will be our generic notation for 
vectors. 

Conditional expectations as above are often interesting and relevant in 
their own respect, with e.g. E[X}~ | Yq :v ], E[X| | Yq- u } and ¥(Xk > x \ Yo :n ) = 
E[l(-Xfc > x) | Yo :n ], where l(-) denotes some indicator function, providing 
useful inferential summaries of the latent states. Another very common use 
of such expectations is however for inference on model parameters through 
the EM algorithm. Indeed, assume that the distribution of (X, Y) depends 
on some model parameter (vector) 6. Then in the E-step of the EM al- 
gorithm, it is typical that conditional expectations with functionals like 

h(x 0:n ) = ELo x fc> h ( x o-.n) = Efc=o x fc> H x 0:n) = Efe=i x k-i x k etc - appear, 
in particular if the joint distribution of (X, Y) belongs to an exponential 
family of distributions. For HMMs, the functional h(xo :n ) = X^/c=i ^-( x k-i = 
h x k = j) is used to re-estimate the transition probability from state i to 
state j. Unfortunately, exact numeric computation of the conditional distri- 
bution of Xk, or that of a sequence of X's, given Yo :n , is possible essentially 
only in two cases. Firstly for HMMs, for which the forward-backward algo- 
rithm 0] provides the solution, and secondly for linear Gaussian state-space 
models, for which conditional distributions are Gaussian and the Kalman 
smoothing recursions provide their conditional means and (co)variances [e.g. 



131 . Chapter 4]. For other models, i.e. models with continuous state space 



and non- linear and/or non-Gaussian dynamics and/or output characteris- 
tics, there are no exact numerical methods available and one is confined 
to using approximations. Traditional approaches include Kalman filtering 
and smoothing techniques based on linearisation of the system dynamics and 
output characteristics, such as the extended Kalman filter, but following the 
impact of Markov chain Monte Carlo (MCMC) methods in general, during 
the last 10-15 years there has been a dramatic increase in the interest in 
and use of simulation-based methods to approximate conditional expecta- 
tions given data. When such methods are used to approximate expectations 
appearing in the E-step of the EM algorithm, one often talks about Monte 
Carlo EM (MCEM) algorithms. 
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For a model as above it holds that the conditional distribution of X 
given Y is that of a time-varying Markov chain. This fact lies behind the 
existence of the forward-backward algorithm for HMMs, and is also the cor- 
nerstone of any algorithm that simulates X conditionally on Y. To simulate 
X given Y and a fixed set of parameters 9, there are essentially two dif- 
ferent approaches. The first one, often referred to as local updating, is to 
run an MCMC algorithm that updates one at the time, given Y and 
X_k = (Xj)o<j<n,j^k- Because of the model structure, the only variables 
appearing in this conditional distribution are Y^, X^-i (unless k = 0) and 
Xk + i (unless k = n). By varying k one obtains an MCMC algorithm whose 
stationary distribution is that of X given Y [see e.g. [3]. The other ap- 
proach to simulating X given Y is simply to simulate a full trajectory from 
the conditional distribution in question. This can be done either by for- 
ward filtering-backward sampling (FFBS), or by backward recursion- forward 
sampling. These names stem from the two blocks of the forward-backward 
algorithm for HMMs, replacing either of them by simulation. In this paper 
we focus on the former approach. After recursively computing the filter, i.e. 
the conditional distribution of Xk given Y§±, for k = 0,1,..., n, forward 
filtering-backward sampling first simulates X n from the filter distribution at 
time n and then recursively simulates, for k = n — l,n — 2, . . . , 0, Xk from 
the conditional distribution of Xk given Xk+i and Yq±. 

Comparing the two approaches, the advantage of local updating is its 
simplicity as it only involves simulation from univariate (conditional) dis- 
tributions. Its disadvantage is that a significant burn-in period may be 
required to remove bias, and that mixing can be slow so that many MCMC 
iterations are required to make sure that sample means approximate the cor- 
responding conditional expectations with required accuracy. For FFBS on 
the other hand, one must compute the filter distribution. This is easily done 
for HMMs [e.g. Q, used FFBS for HMMs], but for continuous state spaces 
the filter distributions are in general not available. The foremost advantage 
of FFBS is that the simulated replications of X | Y are independent. 

To approximate filter distributions in models with continuous state space, 
a class of methods known as particle filters, or sequential Monte Carlo (SMC) 
methods, has received considerable attention during the last 10-15 years; see 
e.g. [HI, EH, 0] for introductions to such methods, and e.g. 0, 0, [20| for sur- 
veys and applications. 

Particle filters approximate the filter distribution at time A: by a discrete 
distribution X^i^l^f") f° r which the locations £|, the so-called particles, 
and the non-negative weights ul evolve randomly and recursively in time. 
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Using such an approximation one can thus recursively compute approxi- 
mations to the filter distributions, and then use them to simulate a state 
trajectory backwards. The distribution of this trajectory will then only 
approximately be that of X | Y. The idea to use particle filters for approx- 
imate backward sampling first appeared, to the best of our knowledge, in 
The paper provides theory (see e.g. Theorem 5 and Corollary 6 
therein) that supports the validity of this approach such as consistency re- 
sults ensuring that, as the number of particle increases, the distribution of a 
trajectory sampled using the particle filter converges to the true smoothing 
distribution. 

A recent paper, [H, devised a method related to FFBS but that removes 
bias entirely by adding a Metropolis-Hastings (M-H) step. The approach 
is described in detail below, but in short it involves running a particle fil- 
ter and then selecting a state trajectory not by backward sampling, but by 
sampling one of the particles at the final time-point n according to its im- 
portance weight, and then tracing the history of this particle back to the 
first time-point. The M-H step is constructed so that the stationary dis- 
tribution of the sampled trajectory is indeed the distribution of X| Y. A 
well-known problem of particle filters is however that as the filter recursion 
proceeds beyond a given time-point k, after a while only a few of the par- 
ticles that existed at k will have survived the step- wise selection process. 
This implies that the genealogical tree of the particle filter provides a poor 
approximation to the smoothing distribution at time-points just a bit prior 
to current time. FFBS does not suffer from this kind of degeneration, and 
in the present paper we show how an M-H step can be applied to remove 
bias also from particle FFBS. We also show how the approach can be Rao- 
Blackwellised, by which we mean that sampling of trajectories is replaced 
by the corresponding expectation, which is the backward smoothing recur- 
sion. As a compromise between single trajectory sampling and smoothing 
one may also simulate a small number of trajectories from each set of par- 
ticles. We analyse the various approaches from a variance/cost perspective, 
and show that backward simulation and smoothing can be notablty more 
efficient than sampling from the genealogical tree. 

We started the work on the material presented in this paper when we 
during the writing of the manuscript [l6f| , on approximate data augmen- 
tation MCMC schemes using particle FFBS, became aware of a preprint 
version of [H. Later, in the discussion part following the published version 
of that paper (p. 306-307), we found that Nick Whitley (University of Bris- 
tol) had been thinking along similar lines. Therefore we would like to point 
out some features of our paper that are not found in Whiteley's comment, 
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nor in the paper [l[ itself. One such feature is that we allow for general 
auxiliary particle filters in the MCMC sampler, and another one is the mul- 
tiple trajectory sampling idea that compromises between single trajectory 
sampling and backward smoothing, as well the variance/cost analysis of this 
and other approaches. 

2 Preliminaries 

In this section we sharpen the notation and introduce some basic concepts 
that will be used throughout the paper. We assume that all random variables 
are defined on a common probability space (f2,3~, P). The state space of X 
is denoted by X, and by y we denote the space in which Y takes its values. 
We suppose that both of these spaces are Polish and write 23 (X) and 23 (V) 
respectively for the corresponding Borel <r-fields. The transition kernel and 
initial distribution of X are denoted by Q and p, respectively, and we assume 
that the transition kernel Q admits a density q w.r.t. some fixed reference 
measure measure A on X, in the sense that 



for all A 6 23 (X) and x £ X. We also assume that the conditional distribution 
of Yj; given = x has a density g(x, •) (the emission density) w.r.t some 
reference measure v. In most applications X and y are products of M, and 
A and v are Lebesgue measures. Here we have tacitly assumed that neither 
Q nor g depends on time k, but the extension to time- varying systems is 
immediate. 

We will throughout the paper assume that we are given a fixed record 
■jjo-n of arbitrary but fixed observations and our main target is to produce 
samples from the joint posterior distribution 4> r . s \ n (A) = f F(X r:s £ A \ Yq ]TI = 
yo-.n), A G 23(X)®^ _r+1 - ) , of a record of states given the observations. The 

def 

special cases (ftk = ^k-.k^ an d <fto-.n\n wm De referred to as the filter and 
joint smoothing distributions, respectively. Since the model is fully dom- 
inated, each joint smoothing distribution ^> 0: fc|fc nas a density (denoted by 
the same symbol) w.r.t. products of A and v. This density is proportional 
to p(xo)gQ(xo) Ylg—i ge(xe)q(xe_i, xg) and we denote by the normalising 
constant. Since the observations are fixed we will keep the dependence 
of any quantity on these implicit and introduce the short-hand notation 

9k(x) d = g{x,y k ) for x e X. 
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It is easily shown that the joint smoothing distributions (</>o:fc|fc)fc=o sa ^" 
isfy the well known forward smoothing recursion 

a t a\ II %A(xo:k)9k(xk)Q(xk-i,dx k )<t> . k - 1 \ k - 1 (dx 0:k - 1 ) 

(PV:k\k\ A ) - FT f T7T7 1 r~7 73 1 , I 2 - 1 ) 

J J 9k{X k ) Q{X k -l, dx k ) O :fc-l|fc-l(«^O:fc-l) 

implying the analogous recursion 

,j A , _ H tA(xk)gk(xk) Q(x k _i,dx k ) <t> k -i{dx k -i) , 2 ^ 

// 9k(x k ) Q(x k -i,dx k ) (f) k -i(dx k -i) 

for the filter distributions. Conversely, the joint smoothing distributions may 
be retrieved from the filter distribution flow using the so-called backward 
decomposition of the smoothing measure. Indeed, let, for A G 23 (X), 

t? (x' A) = /MaOgfosQ/^aQ ^ ^ 

M ' J q(u, x') n(du) 

be the reverse kernel associated with Q and fx, where fj, is a probability 
measure on X. In particular, by letting fi be some marginal distribution of 
X we obtain the transition kernel of X when evolving in reverse time. Using 
(|2.3p . the joint smoothing distribution 4>o- n \ n may be expressed as 

/„ n—l 
••• / Ia(^o^i) (f>n(dx n ) Y[ Q<p k {x k+1 ,dx k ) (2.4) 

k=0 

for A e S(X)®( n+1 ) @, Corollary 3.3.8]. Here (t^J^ are the so-called 
backward kernels describing transitions of X when evolving backwards in 
time and conditionally on the given observations. Consequently, a draw Xg :n 
from (j)Q- n \ n may be produced by first computing recursively, using f|2.2j) . the 
filter distributions (<f> k ) k=0 (the forward filtering pass), simulating X n from 
<^n, and then simulating, recursively for k = n — 1, n — 2, . . . , 0, X k from 
(Xfe+i, •) (the backward simulation pass). This is the mentioned FFBS 
algorithm. 

As stressed in the introduction, joint smoothing distributions can be ex- 
pressed on closed form only for a very few models. The same applies for any 
marginals of the same, including the filter distributions, and thus the de- 
composition (|2.4p appears, at a first glance, to be of academic interest only. 
However, while there is a well-established difficulty of applying SMC meth- 
ods directly to the smoothing recursion (|2.ip (as resampling systematically 
the particle trajectories decreases rapidly the number of distinct particle 
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coordinates at early time steps; see Section [37L]) , SMC methods may be effi- 
ciently used for approximating the filter distributions. Hence, by following 
[l2| and replacing the filter distributions in f|2.4[) by particle filter estimates, 
we obtain an approximation of the joint smoothing distribution that is not 
at all effected by the degeneracy of the genealogical particle tree. This issue 
will be discussed further in Section 13.11 



3 Algorithms 

3.1 Particle smoothing 

A particle filter approximates the filter distribution <j) k at time k by a 
weighted empirical measure 

N i 

^((fc) =£=S p^S^dx), (3.1) 

i=l L/£=l W fc 

where (CL w l)i=i is a weighted finite sample of so-called particles (the £fe' s ) 
with associated importance weights (the uj l k s) and 6^ denotes a unit point 
mass at £. We remark that the weights {uj k )f =l are not normalised, i.e. 
required to sum to unity, which motivates the self-normalisation in (|3.ip . 

Based on an approximation as above at time k, an approximation of 
(pk+i can be obtained in different ways; however, two specific operations 
are common to all SMC algorithms: selection, which amounts to dropping 
particles that have small importance weights and duplicating particles with 
larger weights, and mutation, which amounts to randomly moving the par- 
ticles in the state space X. The approach we describe below is called the 



auxiliary particle filter 171 ]. 



Given the ancestor sample {^ k ,ui k )f =l , one iteration of the auxiliary par- 
ticle filter involves sampling the auxiliary distribution 



dcf u l k $ g k+1 {x)Q{C k ,dx) ( f t A (x)g k+1 {x)Q(C k ,dx) \ 
' Etx4l9k+i(x)Q(4,dx) V fg k+ i(x)Q(a,dx) J 

on the product space X x {1, ... , N}, using some proposal distribution 
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where R k is a proposal kernel on X and ($1)^ is a set of adjustment 
multiplier weights. As a motivation for this we note that ^l k + i(W x 

A) is the mixture distribution obtained by simply plugging the weighted 
empirical measure (|3.ip into the filtering recursion f)2.2j) ; thus, by simulating 
a set of particle positions and indices from (|3.ip and discarding the latter, 
a sample of particles approximating (ftk+i is obtained. This procedure may 
then be repeated recursively as new observations become available in order 
to obtain weighted particle samples approximating the filter distributions at 
all time points. We will throughout this paper assume that the adjustment 

■ dcf 

multiplier weights are generated from the ancestor sample according to $i = 
where ~d k : X — > R + is weight function. In addition we will assume 
that the proposal kernel has a transition density r k : X 2 — > R + with respect 
to A. The latter implies that also n^+i nas a density, which we denote by the 
same symbol, on {1, ... , TV} x X. In practice a draw from II is produced 
by first drawing an index I = i with probability proportional to and 
then simulating a new particle location £ from the measure R k (£l,dx). Each 
of the draws (£jL +1 , I l k+ i)f=i from II^ fl is assigned the importance weight 



where iOk+i(') ■ X 2 — > M + is the importance weight function given by 

/ / \ def 9k+l{x') q(x,x') 

u k+1 (x,x) = . (3.2 

$ k (x) r k (x,x') 

Finally, since the original target distribution is the marginal of $ k+ i with 
respect to the particle position, a weighted sample approximating the former 
is obtained by discarding the indices P k+1 and returning (£ k+1 , io l k+l )f =l . 

The scheme is initialised by drawing (£q)^Li independently from some 
initial instrumental distribution po on (X, 'B(X)) and assigning each of these 

initial particles the importance weight ujq = f cooiQ) where, for x £ X, 

uo(x) d = g {x)dp/dp (x). 

Under suitable conditions the approximation cj) k is is consistent in the 
sense that, as N tends to infinity, 

<$ (h) A <f> k (h) , 

for all (^-integrable target functions h [see^, for some convergence results on 
the auxiliary particle filter]. In addition, as a by-product, an asymptotically 
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consistent estimate of the normalising constant Z n can be obtained as 

1 /n-l N \ N 

4""^nE»«E«i' <*■») 

We remark that as a 'particle one may view not only the actual position 
at time k, but also the whole trajectory (£ 1 , . . . , w here 
the indices (G\.)^Zq of the genealogical path are defined recursively back- 

Qi 

wards through G % k _ x = I k k with G l n = i, of positions that led up to this 
current position. The particle filter may thus be used not only to approxi- 
mate the filter distribution fa, but also to approximate the joint smoothing 
distribution <^>o:fc|fc by viewing the trajectory associated with as a draw 
from this distribution. The set of all such histories is often referred to as the 
genealogical tree. The problem with this approach, in its basic form, is that 
for time-points k smaller than n, the particles will tend to originate 

from the set small set of ancestors at time k. This problem is known as 
degeneration of the genealogical tree, and typically it happens that for k 
small enough, all particles alive at n originate from the same single particle 
at time k. The conclusion is that drawing particle trajectories ending with 
will thus produce a poor estimate of the smoothing distribution 4>k-.k\n f° r 
k just a bit smaller than n, as there are in practice only a small collection 
of particles being sampled at that time-point. Backward sampling, to be 
described in the following, is a remedy to avoid this problem. 

Given a sequence (cf) k )2 =0 of filter approximations obtained in a prefa- 
tory pass with the auxiliary particle filter, a particle approximation of fa :n \ n 
may, as mentioned in Section [2l be obtained by replacing each filter distri- 
bution fa in (|2.4p by the corresponding particle estimate cf) k . This yields 
the estimator 

r. r. n— 1 

<n\niA)= ■ I l A (x :n)^(dx n ) J] Q^(x k+1 , dx k ) (3.4) 

fc=0 

for A G £(X)®( n+1 ). The estimator (J33J) was recently analysed in 0, 
establishing its convergence to fa n \ n in several probabilistic senses. By 

definition (|2.3p . each measure Q^n(x,-), x 6 X, has support at the par- 
tides (Ck)iLi om y an d the weight of each support point £|, is given by 
^kliCki x )/ YleLi w fc<z(£fc> x )- Thus the estimator <^ n | n is impractical since 
the cardinality of its support grows exponentially with n. However, a draw 
from <pQ. n i n is straightforwardly obtained using the following algorithm. 
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Algorithm 1 

(* particle-based FFBS *) 

1. run the particle filter to obtain w^.)i<i<jv,o<fe< 

2. simulate J n ~ 

3. set X* 

4. for k «— n - 1 to 

5. do simulate J k ~ X fc+l))£l 

6. set X* 

7. setX* :n ^(X*,...,X*) 

8. return XQ. n 



We note, for reasons that will be clear in the coming section, that Algo- 
rithm [1] provides, as a by-product of the forward filtering pass in Step 1, an 
estimate (given in (|3.3p ) of the normalising constant Z n . Since com- 
puting the normalising constant of the probability distribution in Step 4 
involves summing over iV terms, the overall cost of executing Steps 2-7 (i.e. 
the backward simulation pass) is Q(nN). As noted in 0], this cost can be 
reduced significantly in the case where the transition density q is bounded 
by some finite constant q+, i.e. q(x,x') < q+ for all (x,x') £ X 2 , which is 
the case for a large class of models (e.g. all non-linear models with addi- 
tive Gaussian noise). Indeed, by applying instead a standard accept-reject 
scheme where a candidate J£ is sampled from the probability distribution 
induced by the particle weights {uJ k )^L 1 (whose normalising constant is ob- 
tained as a by-product of the forward filtering pass) and accepted with 
probability q(£, k k , X k+l )/q + , the corresponding complexity can be reduced 
to 0(n). More specifically, [8|, Proposition 1] proves that the number of sim- 
ulations per index needed for obtaining, at any time step k, N indices J k of 
iV conditionally independent replicates of the backward index chain tends to 
a constant in probability. We will apply this strategy for the implementation 
in Section [H 



3.2 FFBS-based independent Metropolis-Hastings sampler 

Since the density of the smoothing distribution is known up to a normalising 
constant, state-space models can be perfectly cast into the framework of the 
Metropolis-Hastings algorithm. When applied to smoothing in state-space 
models, the output of the M-H algorithm is a Markov chain (Xq^)^>o on 
X n+1 with the following dynamics. Given X^ n , a candidate XQ. n for X^ 1 ^ 
is produced by simulation according to XQ. n ~ k n (X^, •), where k n is some 
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proposal kernel on X n+1 ; after this, one sets 



y(M-i) 

A 0:n 



X * :n w.pr.a n (X^,X *J = lA 



Xq) u otherwise. 



:n'^0:n) 



(3.5) 

The initial trajectory Aq ^ may be chosen arbitrarily. The M-H algorithm 
above admits 4>o :n \ n as stationary distribution, and under weak additional 

assumptions (such as Harris recurrence), (-X"^n)^>o converges in distribution 
to 4>o:n\ n [ see e -g- (3 f° r details]. In order to obtain an acceptance rate 
close to one, one should aim to simulate the candidates from a proposal 
distribution that is as close to 4>o- n \ n as possible. Recalling Algorithm [H a 
natural strategy is thus to generate the candidate using the particle-based 
FFBS. Indeed, with L N (XQ. n ) denoting the law of the draw Xq. u returned 



by Algorithm [TJ |l6j, Theorem 1] shows that under rather weak assumptions 



there exists a constant C n < oo such that for all N > 1, 

\\L N (XU-fo:n\n\\ Ty <C n /N 

where ||-||tv denotes total variation (distance); \\fj,— v\\tv = SU P||/|| <i 
v{f)\ for probability measures \i and v. Unfortunately, constructing an M-H 
kernel based on this proposal distribution (which is independent of the given 
Xq2i) is not possible in practice since the density of L N (XQ. n ) is infeasible 
to compute. However, the joint density of all random variables (i.e. all in- 
dices and particle locations drawn in the forward pass as well as the indices 
obtained in the backward pass) generating the output of the particle-based 
FFBS has a simple form; thus, inspired by we detour this difficulty 
by sampling instead a well chosen auxiliary target distribution on the aug- 
mented state space of all these random variables. Interestingly, it turns out 
that the acceptance ratio of the resulting independent M-H sampler, which 
is described in Algorithm [2] below, is the same as for the standard forward- 
smoothing-based algorithm particle independent M-H sampler proposed in 

a. 



Algorithm 2 

(* FFBS-based IM-H sampler *) 
Input: xjfl 

1. run Algorithm [1] to obtain X* and Z, 
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2. set Xq £ ^ -^-o-.n w ith probability 



yN,* 

( vW V* ^ 1 A 

a n(, A 0:ni A 0:nj " iA ~yW ' 

otherwise set X^ 1 ' <— X^ n . 
3. return X^ l) 

In order to derive precisely the scheme above, denote by £ fc = f . . . , 

and Ifc == (il, . . . , Ifr) the collection of all particles and indices generated 
by the particle filter at time step k > 0. Then the process (£fc,Ifc)fc>i is 
Markovian with joint law given by the density 

/ N \ / n N \ 

^o.-.cn.ii,...^)" n^) nn n ^id) , (3.6) 

v=i / \k=ie=i / 
Let again J^, k — n,n — 1, . . . ,0 denote the time-reversed index Markov 
chain of the backward smoothing pass. The joint distribution of the particle 
locations (£k)k=o an d indices (Ifc)£ =1 obtained in the forward filtering pass 
and the indices (Jk)k=o °f the backward smoothing pass is given by 

n , \\ , . . . , l n , Jo , • • • Jn ) (3-7) 

= ^« 0> ..-,€„, il, .... in) X II Stf * 

2^£=1 w n fc=i 

, in « ui jk - 1 a(P :ik - 1 f jk ) 
- ^^.•••.? n . 1 lv 1 M x jv * 11 V iV * />< rf k N ' 

where the second factor is the conditional distribution of (Jfc)jJ =0 given the 
particle locations and indices obtained in the forward filtering pass. Us- 
ing the density (|3,6p . the law of a draw produced by Algorithm Q] can be 
expressed as, for A G £(X)®( n+1 ), 

L N (XU(A)=^ k N[l A (^°, -..,&)] 

= E^[E,^[l A (C Jo , . . . ,£«)|£ , ■ ■ ■ ,£„,Ii, • ■ ■ >I»]] = %v[< n|n (.4)] . 
It turns out that the distribution targeted by Algorithm [2] is given by 

Nft t ■ • ■ ■ \ def ^0:n|n(?0°' ••• / Q o\ 
(,€o>--->?n> 1 l ) ---> 1 n>.70,---,Jn) = J^+i \ 6 '°) 

^^($ ,...^ n ,h,...,i n ) x -q Uk-MZk-vZk) 
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and the following results, whose proofs are postponed to the appendix, are 
the fundamental for the construction of this algorithm. 

Theorem 1 For any particle sample size N > 1, the distribution o/(£q°, . . . , 
under tt% is 4> 0:n \ n . 

Theorem 2 For any N > 1, the update produced by Algorithm^ is a stan- 
dard M-H update with target distribution ir^ and proposal distribution ■ 

Impose the following (standard) boundedness condition on the particle 
importance and adjustment multiplier weight functions. 



A 1 For all < k < n, \\ujk\\oo < oo and ||#&||oo < °°- 



We now have the following result. 



Theorem 3 Let Assumption [1] hold. Then there is a k E [0,1) such that 
for all I > 1 and all xo :n G X n+1 , 



l-M A 0:nl A 0:n 



I TV < « • 



This result relies on the fact that if the ratio of target to proposal den- 
sity, here Zn'* /Zn, is bounded, the independence M-H sampler converges 
geometrically [cf. |l|, p. 293]. 

Finally, by applying an Azuma-Hoeffding-type exponential inequality for 
geometrically ergodic Markov chains derived recently in [l(| we obtain, as 
a corollary of Theorem [3j the following result describing the convergence of 
MCMC estimates formed by the output of Algorithm [2j 

Corollary 4 Let Assumption [7] hold. Then for all N > 1 there exists a 
constant c > such that for all bounded measurable functions h : X n+1 — > R, 



all m > 1, all initial trajectories — XQ- n € X^ 1 , and all e > 0, 



rn+l 



1 in 



1=0 



> e ) < cexp 



m 
c 



A 



3.3 Rao-Blackwellisation and multiple trajectories 

The results above show that the acceptance probability for a proposed trajec- 
tory obtained by backward sampling does in fact not depend on the current 
or proposed trajectories themselves but only on the likelihood estimates, 
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computed from the set of particles and their importance weights, underly- 
ing the respective trajectories. Theorem 2 in [l| shows that the same holds 
true when tracing a trajectory backwards from the genealogical tree, an 
MCMC algorithm they referred to as the particle independent Metropolis- 
Hastings (PIMH) sampler. Therefore we can view the M-H sampler as one 
that proposes and possibly accepts sets of particles rather than trajectories, 
and from the current set of particles we may choose to simulate a trajectory 
either by sampling a final state and following its genealogy backwards, or 
by backward sampling. 

Denoting by Xq™ a trajectory simulated by either method using the 
current set of particles, we know that 

nHX 0:n ) | y 0:n ] = E{E[h(X^) | (£ kl u k ) < k < n ] | y 0:n }, 

where expectations are computed under the stationary distribution of the 
MCMC sampler and the notation ^jfc)o<fc<n also contains the ancestral 
history of each particle, if required. Therefore it also holds that 



nh(X .. n ) | y :n] = E < E 



(€ki u; fc)o<fc<r 



V0:n 



where now Xq^° denotes one of J trajectories sampled independently. 

Moreover, we can in principle remove sampling of trajectories altogether 
by letting J — > oo. This is equivalent to enumerating all possible sampled 
trajectories xo:n ; computing the probability v XQ . n say of that trajectory being 
sampled, and finally computing the weighted average Yl v Xo . n h(xQ :n ). When 
sampling from the genealogical tree this is possible to do, as there are only iV 
different possible trajectories (ending at positions £^ for i = 1, . . . , N). Re- 
placing sampling by averaging in this way is known as Rao-Blackwellisation. 
Section 4.6 in [1[ certainly does point this out, and it also provides con- 
vergence results for the weighted average above. For backward sampling it 
is generally not possible to work with all possible trajectories, as there are 
typically N n+1 of them, but for low-dimensional distributions of Xq :u , like 
that of a single X k or a pair (X k ,X k+ i), Rao-Blackwellisation is feasible. It 
can then be obtained by iterating the normalised weights at time n back- 
wards through the backward kernels, which is equivalent to the backwards 
pass of the forward-backward algorithm for HMMs. Thus we obtain the 
smoothing probability vi say that a sampled trajectory will pass through £L 

at time k, and we can compute the weighted average Yld=i v k^^k)- -^ or a 
pair (X k ,X k+ i), a similar computation is possible. 
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Computing smoothing probabilities obviously requires more computing 
time than does tracing a trajectory backwards as when sampling from the 
genealogical tree. However, since the tree will have low variability at time 
points away from the final time point n, averaging over such points will 
involve summing over just one or a few particles. Backward smoothing 
does not suffer from this problem, and hence we can expect better Rao- 
Blackwellisation for all time-points except for the few last ones. A com- 
promise is however also possible, namely to simulate a number, say 5-25, 
trajectories using backward sampling and computing the average over those. 
We will now take a closer look at this approach. 

Write (£ (r) ,cjM) for the set (£ fc j <^k)o<k<n °f particles and weights in 
the r-th iteration of the MCMC algorithm, and let Xq.'^\ j = 1, . . . , J, be 
J trajectories obtained from this set of particles using backward sampling, 
simulated independently. Assume for simplicity that we wish to estimate 
E[/i(Xfc) | yo-n] for some k; the discussion here generalises with only nota- 
tional changes to functionals of one than one X-variable. 



is our esti- 



Running R MCMC iterations, (1/RJ) Y^?=i E/=i K x l 
mate of E[/i(X&) | yo :n ]. To express the variance of this estimate, consider 



Var(£;X>(X 

r=l j=l 





r=l 



+ Var 



E 



+ Var 



^JVar(^I im )|(4 (r) ,o;W)) 
a 

R 



r=l 



RJ^a 2 + JR- 1 Var 
RJ{a 2 + Jal}, 



r=l 



where X sim as above denotes a generic trajectory obtained by backward 
sampling, a 2 = E[Var(/i(X^ im ) | and a 2 ^ is the limit as R — > oo of 

the normalised variance of the sum in the second last step. This limit, the 
so-called time-average variance constant (TAVC) in terminology from 0, 
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Chapter IV. 1], will exist if the MCMC sampler mixes not too slowly. Thus 
we can approximate the variance of our estimate as 

Var (ijEE^)) - ^/J + o*.). (3-9) 

\ r=l 3=1 / 

Now assume that it takes time tpf to simulate one set of particles, 
and that it takes time tbs to simulate one trajectory using backward sam- 
pling. The total computational cost for obtaining the estimate above is 
then i?(rpF + Jtbs)- If we have a total computation time r available, we 
can minimise the right-hand side of (|3.9p under the constraint that the total 
computation time is r. Treating J as a continuous variable, one finds that 
the optimal value of J is 



Jt 



opt 



ct 2 /t bs 



This expression is quite intuitive; if the variability of h(X^ im ) within a fixed 
set of particles tends to be large (a 2 is large) and sampling trajectories is 
quick (tbs is small), then we should reduce variability by drawing many 
trajectories. Likewise we should do so if variability between sets of particles 
is small (a 2 ^ is small) and it is time-consuming to generate new sets of 
particles (tpf is large). 

In practice neither of the parameters involved above are known, so they 
need to be estimated from data and run times. In the example below we 
illustrate this. 



3.4 Including a parameter 

Andrieu et al. [H, Section 4.4] devised an algorithm, referred to as the particle 
marginal Metropolis- Hastings (PMMH) update, for sampling in the case 
where a model parameter is included in the MCMC sampler's state space. 
We will now outline, briefly, that an entirely similar approach is applicable 
when trajectories are proposed using FFBS. 

Thus there is a parameter (vector) 9 in some space 0, and the transition 
density q, the emission densities gk, and the initial distribution p may all 
depend on 9. To 9 belongs a prior density (with respect to some dominating 
measure on 0), denoted by tt. The joint posterior density of 9 and XQ :n , 
which we denote by n n (9, xo :n ), is then proportional to ^{9)4>Q- n \ n {xQ- n \ 9). 

The MCMC algorithm uses a proposal density kg(-\-) say for proposing 
new values for 6, and is as follows. 
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Algorithm 3 

(* FFBS-based PMMH sampler *) 
Input: 0® and X$ n 

1. sample 9* from kg(-\9^) 

2. run Algorithm [TJ under the parameter 9*, to obtain A"* and Zn'* 

3. set (0^ +1 ),X o ( f+ 1) ) <- (0*, X *J with probability 

fc e (fl(*>|fl*)zff'* 

otherwise set (fl^ 1 ) , ) <- 

4. return (^+ 1 ),xjf+ 1) ) " 

In the same fashion an in [l[ , one may show that on an enlarged MCMC 
state space emcompassing 9, (£ , . . . , £ n ), (ii, . . . , i n ) and (jo, . . . ,jn), the 
proposal density of the sampler is 

n , \\ , . . . , l n , Jq , . . . J n , u ) 

where 9q is the current parameter and k^ (£o> ■■■■> £rn ii , - - - , in ; jo, ■ ■ ■ in! 
is as in (j3.7j) but with dependence on included, that the density targeted 
by the MCMC sampler is proportional to 

K{0)Kn (£<)) • • • > £n> il, • • • , in, jo, ■ ■ ■ Jn, 9) 

with Tr^ (£q j • • • > £n> ^1 j • • • ) hit JOi ■ ■ ■ j jn] @) as m (|3-8P but with dependence 
on # included, and that the marginal distribution of (6>,£q°, . . . ,£n n ) under 
this target density is the posterior i: n {9,XQ- n ) (cf. Theorem [T]). Moreover, 
under standard assumptions on irreducbility, the sequence {9^\X^ n ) gen- 
erated by Algorithm [3] will converge in distribution to vr n (0,xo :n j [cf- EL 
Theorem 4.4b]. 

Since the acceptance probability of Algorithm [3] does again not depend 
on the current or proposed trajectories themselves but only on the likeli- 
hood estimates, one can just as in Section 13.31 draw multiple trajectories 
from the current set of particles, or average over all of them using backward 
smoothing, to reduce the variance of sample means that approximate pos- 
terior expectations of functionals of the latent states. Also, again the same 
remark applies when trajectories are sampled backwards from the genealog- 
ical tree. 
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4 Example 



In this section we illustrate the methods developed above for a state-space 
model often referred to as the growth model, and which is a standard example 
in the particle filtering literature. The model is 

X k = \x k _ x + 25 - Xk ~\ + 8cos(1.2fc) + V n , (4.1) 

Yk = ^X 2 k + W k , (4.2) 

with Xi ~ N(0,o-g), V k ~ NID(0,<4) and W k ~ NID(0,cr^). Because of 
the square in the measurement equation (|4.2|) . the filter distributions 

for this model are in general bimodal. 

We chose parameters Oq = 5, a v = 10 and = 1 [an example also 
studied inQ]|, and N = 500 particles. We simulated a set yi^o of observa- 
tions, i.e. n = 50, and then R = 5000 sets of particles. We used the bootstrap 
filter, i.e. the filter with all adjustment multiplier weights {} % k = 1 and pro- 
posal kernel equal to the system dynamics; in other words, R k -\(x,-) was 
the Gaussian density with mean as in the right-hand side of (|4.ip and with 
variance a^y. For the bootstrap filter the importance weights u k+1 (x, x') 
simply become the emission densities g k +i(x'). The number of accepted 
proposed sets of particles was 1515, yielding an empirical acceptance ratio 
1515/i? ~ 0.30. We did not use a burn- in period at all, as the output showed 
no signs of a significant initial transient. 

With the aim of estimating K[X k \ yi :n ] for each k, we did in each sweep 
of the MCMC algorithm, i.e. for each current set of particles, 

(i) simulate one trajectory by tracing the genealogical tree backwards; 

(ii) compute the Rao-Blackwellised average, for each X k , over all N back- 
ward trajectories from the genealogical tree; 

(iii) simulate J = 25 trajectories using backward sampling; 

(iv) run the backward smoothing algorithm to compute a smoothed average 
of X k , which is the same as Rao-Blackwellising backward sampling. 

We denote these four methods by GT, GTRB,BS and BSM respectively. 
Thus GT is what it referred to as PIMH in |1[. Backward sampling was 
done using the importance sampling (IS) scheme in 0, Algorithm 1]. This 
scheme avoids computing all backward transition probabilities when extend- 
ing a trajectory one step backwards, although we did abort IS, computed 
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all transition probabilities and used them to simulate the state in question 
after 15 failed IS proposals. The average IS acceptance rate over all sets of 
particles and time-points was 16%. 

Sample averages over the R sets of particles, and, in the case of backward 
sampling, over the J simulated trajectories for each set of particles, are 
shown in Figure Q] Obviously all methods provide the same result, which 
they should, so the differences lie in the variances. 




5 10 15 20 25 30 35 40 45 50 

k 



Figure 1: Estimates of E[X^ | yi :n ] computed as sample means from the 
methods GT, GTRB, BS and BSM. The four curves overlap and are not 
distinguishable. 

For BSM we have the expression a 2 ^ BSM k /R for the asymptotic vari- 
ance, where a 2 ^ BSM k is as in Section 13.31 observe that the expression 
E[/i(X^ im ) | £( r '], with h as the identity function, is indeed the mean of 
obtained with backward smoothing. Here we also include a subindex k as 
this variance will depend on k, and also a subindex BSM as we will require 
similar variances for GT and GTRB. For BS we have the asymptotic vari- 
ance (Ofc/^ + c^o BgM k )/R as in (|3.9p . where subindex k in a\ again denotes 
dependence on time-index k. The asymptotic variances of GT and GTRB 
we write as ^ GTi /i? and a 2 ^ GTRB k /R respectively, where a 2 ^ GT k and 
a oo gtrb k are TAVCs defined similarly as a 2 ^ RB k but for one trajectory 
sampled from genealogical tree and for the weighted average over all such 
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trajectories respectively. 

In practice neither of these variances are known, and we need to estimate 
them from the simulations. We estimated a\ by first for each set of particles 
computing the sample variance of all J trajectories X s ^ m,:) obtained by back- 
ward sampling, and then computing the average of these sample variances 
over all R sets of particles. The TAVCs cx^ were estimated by summing u 
estimated autocovariances over lags \i\ < y/~R~, weighted by (l — \£\/R) [cf. 
p. 59]. Inserting these variance estimates into the expressions for asymptotic 
variances and taking square roots, yield standard errors for the respective 
estimates of ELY*. | yi m ], shown in Figure [2j 




10 15 20 25 30 35 40 45 50 



Figure 2: Standard errors of estimates of E[Xfc | yi :n ] for the methods GT 
(x), GTRB (o), BS (+) and BSM (*). 

We see that GT has standard errors larger than those of BS and BSM, 
which is to be expected as GT samples a single trajectory while BS samples 
J = 25 trajectories and BSM averages over all of them. We also see that 
the standard errors of GT and GTRB are close to identical expect towards 
the final time-point n. This is a result of the degeneracy of the genealogical 
tree as for k just a bit less than n there are only one or a few ancestors 
with descendants alive at time n, and then Rao-Blackwellisation (GTRB) 
adds little compared to just sampling (GT). For k > 43 say GTRB however 
does better, and it is on par with BSM for k > 48; for such late time-points 
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the final particles' ancestries have not coalesced and at time n GTRB and 
BSM are equivalent. Comparing BS and BSM we find that they have similar 
standard errors, and this is because the term a\j J, with some exceptions k, 
is smaller than BSM k for J = 25. 

Comparing standard errors without comparing execution times does not 
provide the full picture however, and for that reason we introduce a measure 
of precision per computational effort, defined as inverse variance over com- 
putation time. We refer to this measure as efficiency, and we can estimate 
it using inverse squared standard errors over measured computation times. 
The computation time of each method was measured using the function 
cputime in Matlab, the software used for all simulations. Figure [3] plots 
these estimates. We see that BS is better than BSM, which in turn is better 
than GT and GTRB which perform about equally. The exception is the last 
few time-points for which GTRB, which is fast, does very well. The ratios 
of efficiencies for BS vs. GT (for all k) ranges from 0.19 to 30.7, with 48 
(out of 50) of them being larger than one and their geometric mean being 
5.4. For BSm vs. GT the corresponding figures are 0.04, 11.4, 36 and 1.8 
respectively. 



500 
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Figure 3: Estimated efficiency for estimating E[X^ | yi :n ] for the methods GT 
(x), GTRB (o), BS with J = 25 trajectories (+), BS with J = 7 trajectories 
(A) and BSM (*). The y-axis is truncated; GTRB reaches efficiencies about 
900 s _1 for the final time points. 
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Having said that, we remark that figures like these crucially depend on 
software and implementation. GTRB is fast because Mat lab does vectorised 
operations quickly, and we also believe that BS has a slight disadvantage 
from slow random number generation in Matlab. In addition the resolution 
of cputime appears to be 10 ms, which may not be short enough to provide 
accurate measures of execution times (as we measured the time of each call 
to functions performing the various methods). 

As discussed in Section [373^ we can choose the number J of trajectories 
sampled in the BS method to achieve the best variance/cost performance. 
This number varies quite a bit with respect to k however, with the minimum 
value of J op t (viewed as a continuous variable) being 0.74 and the largest 
18.9. As a compromise we chose the geometric mean 6.98, rounded to J = 7 
(the arithmetic mean is 8.12). The estimated efficiency for this J is also 
plotted in Figure [3j and we see that there is improvement over J = 25 that 
is mostly marginal, but for some k notable. The ratios of efficiencies for this 
optimised BS vs. GT range from 0.50 to 37.6, with 49 (out of 50) of them 
being larger than one and their geometric mean being 7.5. 

A Proofs 

A.l Proof of Theorem [D 

To prove the statement we simply carry through the marginalisation. Thus 
let d = {Ck)i<i<N,i^e and define analogously; the marginal of 7r^ with 
respect to (£o°> ■■■■> £n ?l > ^o, - - - j Jn) is then obtained by integrating ir^ over 
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(ifc, £ k Jk )k =0 - We start with integrating over \ n and £ n Jn according to 

(£o> • • • iZn-l^tih, ■ ■ ■ ,in-l,JO, • • • , jn) 

= ^2j n n (£o> ' ' ' . £n> il, • • • . in, JO, • • • , Jn) c?C >l 



> • • • ,sn J w / "I T ^fc_igUfc_i,?fc j 
,fc=l Z^£=i w fe-i9l?fc_i,? fc 



j:n|nVS0 

— X 



iV n + 1 



^ p (^°)nLinf(if,ef) r 




^n-lg(C-l^n") 

E£d«4-i9(d.i.^; 



In the expression above, 



and 



\ ~> f Vff(£o, ■ ■ ■ , ^n, il, • • • , in.) j£-jn 



_ ^Pn_ (£o , • • • ,£n-l, ■ • • , in-1, 

Combining (|A.l[) - (|A.3j) yields 

^(^0, • • • ,^n-l,Cn",il, • • • ,in-l,JO, • • • , jn) 

n-1? !lj • • • j In— lj 

X 



(A.l) 



•in Z J fcl W n-l ( /^n-l,Sn J 



(A-3) 



y n fi q>Ia£v&) (A4) 



Now, by integrating (jA.4j) with respect to (i n _i,£ n f^ 1 ) and repeating the 
same procedure for (i n _2, £ n -2~ 2 ), ■ • • , (ii,^i" Jl ) an d finally £q Jo we obtain 



23 



(£30 £jn\ 



the marginal density 

. . ,e,io, ••••./,; " u:n|nv ; u m+ r - ' • (a.5) 

Finally, for any rectangle A = A x A\ x • • • x A n in V 

= E/ • / ^(^°---e,JO,...,in)^ O --^e=0O:n|n(^), 

implying that these measures are identical on £(X)®( n+1 ). We complete 
the proof by noting that the arguments above apply independently of the 
particle sample size N. 

A. 2 Proof of Theorem d 

It is enough to prove that 

def T^n (£p > • • • i^m ^1 ; • • • i ^n, Jp , ■ ■ ■ , Jn) _ %n 
kn(£o> ■ ■ ■ > £rn III • • • 5 1») Jo, ■ ■ ■ Jn) %n 

where is defined in (|3.3p . Using that 

j J k j J k 

yjN ( jJk C^k\ _ ^k — l k— 1 / £^k^ C^k\ 

we obtain 

K _ 0o : „.in(Co J %--- ) eonLi{9(^le fe Jfc )Ef = i^LiCi}Ef=i^ 



(A.6) 



Now, by the definition (|3.2p of the importance weights, 

4 k 4-i r k-^tvtk) = gk&mt^) (A.7) 

and plugging the identity (|A.7|) into ()A.6|) gives, using that, in addition, 

Po(£o Jo H J ° = p(eo°)9o(e Jo ), 

ji ^n.|n(e J °, ■ ■ ■ , j J n) E<=1 ^ Wl=l E/=l "tl^-l 

^ + V(e Jo )9o(eo Jo ) nLiCsfcC^M^.fif*)} 
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Finally, we conclude the proof by noting that 

n 

<h^M\- ■ ■ = P^Mtf) R {9k (£ h Mttl - £ )> ■ 

fc=l 
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